perm filename PAPER[X,ALS] blob sn#086472 filedate 1974-02-12 generic text, type T, neo UTF8
00100		The Amanuensis Speech Recognition System, under development
00200	at the Stanford Artificial Intelligence Laboratory attempts to
00300	extract the maximum linguistic information from the acoustic signal
00400	of continuous speech. While it is now generally recognized that a
00500	complete speech understanding system must make use of many forms of
00600	knowledge, syntactic, semantic and contextual, in addition to to
00700	information contained in the actual acoustic wave form, it is also
00800	obvious that the over-all performance of a speech understanding
00900	system depends critically upon the use that is made of the acoustic
01000	input.
01100	
01200	
01300		It is our belief that recent developments in computer
01400	hardware and in our ability to make effective use of this hardware
01500	has reopened the question of the extent to which the acoustic input
01600	can be utilizied and that it is wise for there to be some expanded
01700	effort along these lines. Accordingly, we have directed our efforts
01800	to what might be called the front end. We are attempting to
01900	demonstrate the extent to which acoustic information alone can be
02000	used in solving the general speech recognition problem. At the same
02100	time the work of others will demonstrate the extent to which higher
02200	sources of knowledge can compensate for deficiencies in the acoustic
02300	wave and in our ability to abstract significant linguistic
02400	information from it. Ultimately the two approaches can be combined to
02500	achieve a level of performance that could not be achieved with either
02600	approach alone.
02700	
02800	
02900		The Amanuensis approach differs from the earlier acoustic
03000	systems and from the front-end approaches currently being utilized by
03100	rest of the ARPA community in a number of important respects. In the
03200	first place we make no simplifying assumptions regarding the
03300	uniqueness of a phonemic event. As everyone knows the phonemes of
03400	real speech are not isolated phonetic events but are manifest by
03500	clues which overlap and extend for some distance from the central
03600	region which might arbitrarily be assigned to a given phoneme.
03700	Furthermore any one phoneme can and does occur in a variety of
03800	allophonic variations, which themselves are seldom pronounced the
03900	same even by the same speaker and in the same utterance. In
04000	continuous speech, many of the clues which establish the identity of
04100	a given allophone are modified by the environment of the alophone and
04200	some of them may be completely missing.
04300	
04400		We attempt to deal with these complications by utillzing
04500	redundant sets of clues, obtained both from the steady or nearly
04600	steady portions of the wave form and from the transition regions. We
04700	handle the mass of data which this approach generates by using
04800	signature tables to correlate these data with significant features of
04900	the utterance and ultimately with the phonemic intent. The required
05000	multimodal relationships are establishes by means of training
05100	sessions and are expressed by probability values stored in
05200	the tables. Probability values are retained, not only for the most
05300	probable choice for each phoneme, but also for alternate choices. The
05400	higher level portions of a complete speech understanding system can
05500	then select alternate choices on a probabilistic basis whenever the
05600	first choices are found to be unreasonable in the light of syntactic,
05700	semantic or linguistic constraints.
05800	
05900	
05910		A second important difference between our work and that currently
05920	being done elsewhere has to do with our use of machine learning techniques.
05940	These techniques enable us both to establish desired relationships between
05945	acoustic clues and the phonemic intent of the speaker and to compensate for
05947	differences between speakers. Some of the relationships between the
05948	available acoustic input parameters and their phonemic interpretation
05949	are not at all obvious and all too often we have found that our a' priori
05959	evaluations were quite incorrect even when these were based on our
05970	acoustic input. All too often ourintuitive evaluation based on knowledge
05975	of earlier work has proven to be grossly in error.
05978	
05980	
05990	
06000		While the Amanuensis approach is envisioned as a useful part
06100	of a complete speech understanding system it also provides a study
06200	tool which can be used to arrive at a better evaluation of the
06300	usefulness of different acoustic clues. In a final understanding
06400	system these clues might be extracted by the same or alternate
06500	computer methods or they might better be extracted by specially
06600	constructed hardware.
07000